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Abstract 

Protein sequences are believed to have been selected to provide the sta- 
bility of, and reliable renaturation to, an encoded unique spatial fold. In 
recently proposed theoretical schemes, this selection is modeled as "minimal 
frustration," or "optimal energy" of the desirable target conformation over 
all possible sequences, such that the "design" of the sequence is governed by 
the interactions between monomers. With replica mean field theory, we ex- 
amine the possibility to reconstruct the renaturation, or freezing transition, 
of the "designed" heteropolymer given the inevitable errors in the determi- 
nation of interaction energies, that is, the difference between sets (matrices) 
of interactions governing chain design and conformations, respectively. We 
find that the possibility of folding to the designed conformation is controlled 
by the correlations of the elements of the design and renaturation interaction 
matrices; unlike random heteropolymers, the ground state of designed het- 
eropolymers is sufficiently stable, such that even a substantial error in the 
interaction energy should still yield correct renaturation. 
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I. INTRODUCTION 
A. What is this work about? 

The native state of a protein is in a sense "written" in the sequence using the "lan- 
guage" of physical interactions between monomers. In this work, we examine the effects of 
"misunderstandings" and "misspellings" of this language. 

A somewhat related question was recently discussed by Bryngelson [0. He considered 
heteropolymer chains with random sequence and estimated the probability that its lowest 
energy conformation will be correctly detailed by the model with noisy distorted potentials of 
volume interactions between monomers. The result is that the probability, p, diminishes with 
noise amplitude, 77, as p ~ 1 — const ■ rfN^/'^] for sufficiently long chain, or in thermodynamic 
limit, there is no chance to compute equilibrium conformation given that some mistakes in 
the determination of energies are inevitable. 

By contrast, we consider here heteropolymer chains with sequences that are not ran- 
dom, but rather "designed" p, or "imprinted" 0, or "selected" 0, or, in other words, 
obey the so-called principle of minimal frustration 0]. We show, that for these chains the 
situation is dramatically different, and there is finite probability of successful recovery of 
thermodynamically stable conformation, even in thermodynamic limit (A^ 00) and for 
finite non- vanishing 77. 

As the work is based on rather heavy theoretical machinery, we begin with more general 
introduction. 

B. Protein folding in a statistical mechanics perspective: a brief summary 

Protein folding is one of the great challenges of modern biophysics. Recently, there has 
been much progress and insight into protein folding from the statistical physics perspective. 
This came mainly with the ideas borrowed from the physics of disordered systems, such as 
spin glasses 0. It is now widely believed that folding of a protein chain can be viewed as a 
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freezing phase transition, as in spin glasses. The sequence of types of monomers along the 
chain leads to quenched disorder, and polymer bonds between monomers impose frustrations. 
The concept of a freezing transition seems to resolve one of the mysteries surrounding the 
problem, namely, why the native state of a protein is realized via a unique conformation, at 
least, in a coarse grained sense. 

One of the main differences between proteins (and biopolymers in general) and other 
systems more familiar to physicists is the way in which quenched disorder appears. While 
in regular spin glasses and similar "un-animated" systems, the appearance of any particular 
realization of the disorder is far from any control, in biology the situation is dramatically 
different. First, the biosynthesis of proteins (as well as other biopolymers) manages to pro- 
duce macroscopic amount of identical copies — a situation unthinkable in other disordered 
systems. Furthermore, the sequences of the existing real proteins are believed to be the 
product of evolution and thus are of great interest. Do they have some distinct properties 
compared to other possible sequences? Do they provide some general properties of proteins, 
such as their ability to form the unique spatial fold, or only particular conformation and/or 
function of each particular protein? 

To this end, it now seems well established that the ability of polymer chain to undergo 
a freezing phase transition into the state with a unique (or almost unique) conformation 
is common to many models of random heteropolymers. In other words, to achieve the 
uniqueness of the ground state conformation one does not have to impose any requirements 
on the sequence, which can be therefore chosen even at random. This was first shown for 
the so-called independent-interaction model (which is in a sense even more random 
than random heteropolymer: instead of independent monomers, it has about ~ A^^ 
independent monomer-to- monomer interaction constants). This argument was extended for 
random copolymers with two types of monomers |P and brought into the most general form 
in Ref . ^ . 



In independent (and even earlier) development, the hypothesis was made that the se- 
quences of real proteins deviate from random in such a way that to satisfy the requirement 
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of the so-called minimal frustration principle" 0. In a sense, this idea goes back to even 
earlier ideas of Abe Recent developments, related mainly to computer Monte Carlo 



simulations of freezing kinetics, reveal the insight into the role of minimal frustration as a 
factor pulling down the energy of the ground state conformation and thus providing the gap 
in the energy spectrum necessary for reliable folding 0. 

Thus, it seems that kinetic reliability of folding requires that sequences of proteins are 
not random, but "edited." Statistical analysis reveals indeed, that although the sequences 
are close to random [jl2| , the systematic deviations from randomness do exist , and they 
are at least compatible with the idea of energy optimization of the ground state. 

The question appears how to model the ensemble of sequences more realistically then by 
taking them at random. Two approaches has been recently suggested. Both employ physical 
interactions between monomers to build up polymeric sequence, and are both based on en- 
ergy optimization of the native conformation. The first approach, due to Shakhnovich and 
Gutin 10] , implies the search for optimization in sequence" space by swapping the monomers 
along the chain while preserving (native) conformation. Apart from the speculative possi- 
bility to model evolution, this is obviously intended for computer simulation. On the other 
hand, we suggested energy optimization in the ^^monomer soup" prior to polymerization . 
Although these two models are considerably different in spirit, they appear to be identical 
from the point of view of mean- field theoretical treatment. 

The freezing transition of an ensemble of sequences which have been energetically op- 
timized ("designed") for a particular conformation was examined for the black-and-white 



model (with two types of monomers) in Refs. [|T4|,|15|; the analysis was extended for the gen 



eral case in Ref. |T^. For what follows, it is important that three globular (that is, compact) 



phases were found on the phase diagram of thus "designed" or "imprinted" polymer: we call 
them random, frozen and target. The random globular phase is pretty much the same as 
the globule of a homopolymer, it is comprised of vast number of conformations (all of which 
are compact). By contrast, frozen and target phases are each comprised of one or very few 
conformations. In the target phase, this unique conformation is exactly the one which is 
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targeted by the design procedure. On the other hand, in the frozen phase, the system freezes 
to the conformation which is unrelated to the design conformation and therefore cannot be 
controlled. 



C. Sequence design and folding are governed by different interactions 

As sequence design is based on energy optimization, it employs physical interactions 
between monomers. It is however possible, and, moreover, almost inevitable, that these 
interactions are somewhat different from those governing folding. Apart from speculations 
on the interactions that governed the "design" of modern proteins by evolution, we mention 
three illustrations of our thesis: 

1. When one tries to find theoretically or computationally the native state for a chain 
with a given sequence (direct protein folding problem), one can say that nature details 
the interactions used in the design of protein sequences and man-made potentials are 
used as substitutes in the simulations of renaturation. 

2. Similarly, when one is looking for a sequence to fold into a given conformation, one is 
essentially trying to design the sequence using artificial potentials in such a way, that 
this sequence under real natural interactions will fold in a desirable way. 

3. Speaking of the attempt to reproduce protein-like properties in the man made het- 
eropolymer via the Imprinting procedure 0, we have to acknowledge some difference 
between interactions of monomers in the soup prior to polymerization and interactions 
of the links of polymer. 

4. One can consider the renaturation of a protein in a solvent different than that used 
during "design" also as an experiment in which the interactions during design and 
renaturation are different. 

If there are, say, g, different monomeric species involved in our polymer (g = 20 for proteins), 
interactions between species i and j can be described in terms of the g x g matrix Bij. In 
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general, there are two different matrices, Bfj and Bij: the first governing preparation and 
second governing folding behavior of the already prepared chain. 

To have two different interaction matrices for design and renaturation is somewhat similar 
to writer and reader who use different languages. Naprimer, my nadeemsya, chto nash 
chitateV schitaet etot tekst napisannnym po-angliiski i poetomu vryadli poimet etu frazu 
|18| . Clearly, such a venture has a chance if and only if those languages are not completely 
different, but merely dialects of one language. Similarly, infinitessimally small changes to the 
interaction matrix should not have any significant ramifications, while on the other hand, 
a radically different matrix structure should lead to completely different folding behavior. 
Using the terminology of frozen and target phases, we can ask if the chain designed with 
some matrix 5?- will freeze to target state when governed by another matrix Bijl In other 
words, if we want to get the target phase, how accurate should we be in choosing matrices 
B^j and B-ijl Another interesting aspect of the question is which properties of B^j and B^j 
matrices are important, that is to which of them the chain behavior is sensitive? And what 
measure do we use to define the proximity of interaction matrices? 

Previous treatments [0,0] have addressed certain aspects of these questions. However, 
they differ from the present work in that we model the effect of evolutionary optimization and 
the nullification of this by errors in the interaction potentials, whereas Refs. ||T7|Jl[| examine 



the stability of glassy (not evolutionarily optimized) conformations with respect to errors. 

II. THE MODEL 

We start from a heteropolymer chain Hamiltonian in which interactions are described in 
terms of the energy of interaction of species 

N 

'H = Y.B,^,,5{vj-vj) (1) 
i,j 

where Bij is the interaction energy between monomer species i and j G {1 . . . q}), sj is 
the species of monomer at position I along the chain, is the number of monomers, and 
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Vj is the position of monomer /. We use the convention that lower case roman letters label 
species space, upper case roman letters label monomer number along the chain, and lower 
case greek letters label replicas. 

We do not explicitly include in the Hamiltonian (|1]) anything leading to the overall 
collapse of the chain. We do imply, however, the existence of some strong compressing factor, 
such as overall homopolymeric-type poor solvent effect (expressed with Ti' = Bp^ + Cp^ with 
species independent B and C and strongly negative B) or box-like external field such that 
the polymer is always in a globular conformation. The particular choice of compressing 
factor is known to be unimportant pSf provided that the chain is long enough; we do not 
discuss here any finite size effects related to the surface of the globule even though these 
might be important for real proteins. Furthermore, we stress that this is of vital importance 
for the entire approach that the chain is maintained in the globular compact state (compare 
with Ref. [|l^, where the design scheme failed to work just because the requirement of overall 



collapsed state was relaxed). 

Since the heteropolymer sequence does not change during folding, we immediately en- 
counter the technical problem that sequences are a quenched quantity and thus we average 
the free energy over all sequences (with a particular weighting due to design) rather than 
the partition function. This leads directly to the replica approach. The details of the corre- 
sponding calculation are similar to what is presented elsewhere [0. Here we briefly outline 



the main steps. The replicated partition function can be symbolically written as 



(•^ ) ^ ^ "^sequence ^ ^ exp 

sequence {conformations} 



H (sequence, conformation^) /T 



a=l 



(2) 



where we explicitly mention the dependence of the Hamiltonian (|1|) on both sequence, which 
is the same for all replicas a G 1 . . . n, and conformation, which is potentially different for 
different replicas. Probability distribution over the set of sequences, ^sequence is deflned by 
the preparation process and thus in our case can be written as 

"^sequence ~ \Psi ' Ps2 ' ■ ■ ■ ' Psn] ^ 
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X ^ exp [H^ (sequence, target conformation) /Tp] , (3) 

target conformation 

where we drop the normahzation factor. In the equation ps is the probabihty of appear- 
ance of the monomer species s (which is normally controlled by the chemical potentials of 
components in the monomer soup surrounding the preparation bath), is Hamiltonian of 
the form (|I]) except with the "preparation" matrix instead of B which controls folding 
through equation (H). Accordingly, Tp is the temperature at which preparation process is 
performed. 

We stress that our approach is not restricted to any particular target conformation. By 
contrast, we do average over all possible (compact) target conformations (see equation (H)), 
and thus our scheme picks up not just the good sequences, but the pairs "target conformation 
- sequence which is good for this target conformation," where both terms are well adjusted 



to each other (see also the discussion in Ref. [T^). This is a good match for Imprinting, 
since we assume that some external field chooses sequence-conformation pairs based upon 
matching with the field [^]. Indeed, this may be analogous to protein evolution, in which 
nature chooses sequence-conformation pairs not for any specific nature of the conformation 
or sequence but for its functionality; this can be viewed in physical terms as some external 
field affecting the selection of sequence and conformation ||20l . 



III. FREE ENERGY OF THE MODEL 



Inspection of the equations (||, H) indicates that we can formally express the weight 
corresponding to the design process as an additional replica labeled |ll4|- [l6|j2T[| : 



N 



sequence 1=1 {conformations} 



n N 
a=0 I^J=1 



(4) 



where 5"- ^ = B^ is the matrix which expresses the interactions used for the chain prepa- 



ration (i.e. replica a = 0) and B. 



a>0 



Bij is the interaction matrix which governs folding 



or renaturation. Hereafter, conformations are given in terms of position vectors for each 
monomer number I and each replica a. By the sum over conformations we mean the sum 
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in which the condition of chain connectivity is strictly obeyed (technically this can be done 
either in continuous form as Edwards [^] or in discrete form like Lifshits Il2c 
To facilitate averaging over the sequences, we define the densities 



N 



pr(R) 



I 



(5) 



then rewrite the exponent in equation as 



TV 



■J- n/ ^ X J- m 



(6) 

and perform a Hubbard-Stratonovich transformation on the quantity pf (R), thus introduc- 
ing the conjugate field 0°(R). We average over the sequence and truncate the resulting 
exponent to (9(0^), which yields (see the details in Ref. ||16||): 



a=0 i 



in = E exp 1/ rfR E E + 

conformations " ' 

„ n 

J rfRidR2 E E 



a, [3=0 ij 



\ ( S j S'^^6{Ru R2) + iA,,g"^(Ri, R2) 



(Ri)0f(R2 



(7) 



where we define the overall density Pa(R) = Ylf S{rf — R) = Z^LiPflR') the replica 
overlap order parameter (5a/3(Ri, R2) = J2f ^{''^i — R-i)^(i"f — R2)- Since the density is a 
single replica quantity and we assume the chain as a whole is compressed, that is, density 
is constant throughout the globule, we simply take Pa(R) = p. Furthermore, using a varia- 
tional argument, it was shown [^,14] that freezing occurs down to microscopic length scales, 
thus allowing to take Qa/3(Ri,R2) = pqai3S(Ri — R2), where the form of the conformation 
correlator q^is is found to be that of a Parisi matrix with one step symmetry breaking, with 
either complete overlap (g°^ = 1) or no overlap {q°'^ = 0). (This directly corresponds with 
the Random Energy Model ||2^ introduced directly in previous heteropolymer models ||^.) 
This facilitates Gaussian integration over the fields. To write the result in even sim- 
pler form, we can also include a conformation-independent constant by the transformation 

1/2 



2E,(p5VT«); to get 



in 



E 



conformations 



1/2 



/d{0}exp EE (2p57T,j::>,0j 

I a=0 ij L 



+ 
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E E 

a, 13=0 ij 



1/2 



1/2 



pq 



a(3 



N 



(8) 



where we use a hat to indicate that the object is matrix in species space (i.e. A = Ay). We 
evaluate this Gaussian integral, yielding the free energy 



{conformations} 

where the effective energy of the n replica system is given by 



(9) 



N 



In det 



1/2 



+ 



1/2 



1/2 



P) , (10) 



(I ■ ■ ■ I) denotes the scalar product over species space, the determinant in the first term 
is over species and replica space, and the vector p is given by = pip. Note that the 
only remaining dependence on conformations come through conformational correlators q"^. 
Given the particular structure of effective energy (|10D can be expressed directly in terms 
of the number of replicas which overlap with the target group y and the size of a group x 
for the remaining n — y replicas divided into {n — y)/x groups. Thus, we can simplify the 
expression for effective energy ([T0|) by removing replica dimensionalities, as is performed in 
Appendix A. This also allows one to write the entropy of the macrostate with given x and 
y, as it is associated simply with grouping of rephcas S = Ns[y + {n — y){x — l)/x]. identical 
conformation down to microscopic scale related to the volume v, there is an entropy loss of 
s = ln(a^/f) per monomer, where a is the distance between monomers and v is the excluded 
volume . This allows conversion from the sum over conformations to a functional integral 
over Q"^(Ri, R2), and even further, to conventional integral over x and y, which, in the mean 
field approximation, can be further simplified to optimization of the effective n-replica free 
energy 



F{x,y) n-y 



N 



2x 



In det 



I + 2xAB/T 



P 



2xB^/T \T+2xAB/T 



-1 



P 



+ - In det [/ + 2ABP/Tp + 2yAB/T 
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+ {P 



BP /Tp + yB/T / + 2ABP/Tp + 2yAB /T 



-1 









- s[y + {n~ y){x - l)/x\ 



(11) 



IV. ANALYSIS OF THE FREE ENERGY AND PHASE DIAGRAM 

The expression ( |Tl| ) is rather similar to what we had in Ref. |16| while considering the 
model with identical interactions for design and folding and, of course, it is exactly reduced 
to the corresponding equation of that work when B = B^. Furthermore, this expression 
implies the same structure of phase diagram, with the same three globular phases: random, 
frozen, and target. (We remind the reader, that overall collapse of the chain is the necessary 
pre-condition of our approach, and thus globule-to-coil phase transition falls outside of the 
framework of the present study). To see the structure of phase diagram, we first look at the 
allowed variations of the order parameters x and y. 

For simplicity, we consider here only small s regime. In this case, freezing transitions, 
which are the main topic of our interest here, occur when B is (in a reasonable sense) also 
small. Indeed, freezing phase transitions result physically from the competition between 
energetic and entropic parts of free energy ([TT|) , where energetic part favors gathering of 
replicas into groups while entropic part favors diversity of replicas. For energy to be com- 
petitive to an entropy when s is small, B must be small as well. This allows one to simplify 
equation ([TT|) truncating it to quadratic order in B. 

As y is the number of replicas whose conformation coincides with the target conformation, 
this value must be in between of and n. What is relevant in replica approach is n — > limit, 
and, moreover, only the terms which are linear in n are to be considered (because higher 
order terms disappear in the main equation (InZ) = lim„^o((^") — 1) /n). Accordingly, 
since < y < n, we must linearize the free energy in y as well [0,0. This leads to further 
simplification of (pH]): 



Tr 



- y) {AB/T - xABAB/T^ + PB/T - 2xPBAB/T^] 
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+ABP/Tp + yAB/T - 2yABPAB/TTp - ABPABP/T^ 
+ Pbp/TP + yPB/T - 2 (PB^AB^/T^ + yPBAB'^/TTp + yPBPAB/TTp 
-TNs[y + {n-y){x-l)/x] (12) 

where Pij = piPj. 

While y describes breaking of the symmetry between n rephcas due to their attraction 
to the target rephca labeled 0, x describes spontaneous symmetry breaking. When we have 
integer number of replicas, n, clearly, 1 < x < n: x cannot be smaller than unity, because it 
is the number of replicas in the group. When — > 0, the logic about the number of replicas 
in the group is not applicable any more, but it is natural to think that formal inequalities 
for X just simply flip signs: n < x < 1. With this in mind, we optimize free energy (|12D 
with respect to x yielding the equation which determines x: 

,.2 



s = l;^Tr 



ABAB + 2PBAB (13) 



Note, that this equation does not involve either Tp or B^ and thus it does not depend on 
preparation process. This has clear physical meaning. Namely, this reflects the behavior 
similar to that of REM, because the designed sequence behaves precisely as a random one 
in all the conformations except for the target conformation. 

At this point, it is useful to introduce the following matrix "cumulants": 



{A), ^ Em^^. = Tr (PA) 

id 

(AB), = J2piP,AjB,, - {A)^ {B)^ = Tr (AAAB + 2PAAB) (14) 



where A and B are arbitrary matrices. 

From the above, we can easily find the equation for the freezing temperature for random 
sequences. Indeed, freezing occurs when replicas start to group, thus spontaneously breaking 
the permutation symmetry. This happens when x = 1. Therefore, freezing temperature is 
given via the relation 

T] = {BB)Js (15) 
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In other words, the freezing temperature is given by the variance of the renaturation in- 
teraction matrix [|l^. Note that this is a transition to a unique ground state which is not 
necessarily (and most hkely not) the target conformation: we call this phase the frozen phase 
and we call the high temperature disordered phase in which there is no form of freezing, i.e. 
many conformations dominate equilibrium, the random phase. 

To examine freezing to the target conformation, we must examine the conditions at which 
y > 0. Since y varies from to n, what has physical meaning in the n ^ limit is only 
the linear in y term of free energy. Therefore, free energy optimum corresponds to either 
y = (non-target phase), or to y = n (target phase). To find the corresponding critical 
temperature, we must examine the slope of the free energy at the point y = to determine 



whether y = or y = n is the stable solution [|T^. The condition "slope" = yields the 
relationship: 



Tr 



fABABP PBABP ABp AB PB^ AB\ x\ , ^ , ^ r.r..r.^ 

+ 2 + + 2 ABAB + 2PBAB] 

\ T Tp T Tp Tp T Tp T J 



= ^ {BPBl - ^ {BBl (16) 

This equation defines the phase boundary of the target phase, in which the system freezes 
to the target conformation. 



We combine eqns (|T3| - [T6|) to get the boundary of the target phase (i.e. the prepara- 
tion temperature Tp which separates the target phase from the random and frozen globule 
phases). To write the result, it is convenient to define formally the value of Tpj according 
to the equation 

T^f={B'BP)Js (17) 

similar to eq (|15D except it includes preparation matrix 13^ instead of 13. Physically, Tpf 
is the temperature point at which random heteropolymer would undergo freezing transition 
provided its conformations are governed by B^ interaction matrix; Tpf gives a natural scale 
for Tp-. 
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T^^^ia^^ forT>T, _ ^^^^ 

^Pf \g for T < T/ ' 

as far as the small s limit is concerned, this can be also rewritten in terms of Ttar, acting 
temperature at which random-to-target phase transition occurs: 

Tts,i 



T, 



1 + 



/ 



T 



(19) 



This is the previously obtained result for the transition to the target phase [1^, except with 
the inclusion of a factor which is defined as 



g^{B^B)J^{BB)^{BvBv)^. (20) 

To understand the meaning oi g, it must be noted first of all that g can be treated as scalar 
product, g = cos6, and thus — 1 < (? < 1- This factor gives the degree of correlation between 
the elements of the two matrices, B and B^. If the two matrices are the same (i.e. completely 
correlated) g = 1- The region < g < 1 corresponds to somewhat lesser, but still positive, 
degree of correlation; g = means that matrices are statistically independent; —l<g<0 
corresponds to some anti-correlation; finally, g = —1 means absolute anticorrelation (each 
pair of monomers which is supposed to be attractive in B, is repulsive in B^, and vice 
versa, etc). To see this, it is helpful to note, that the definition of matrix cumulants has 
the property that (B^B"^), where B^ and B^ are both either B or B^, does not change upon 
adding a constant to all matrix elements (this can be easily proven given that J2iPi = 1); 



this allows to define b^j = Bfj - (S^)^ and rewrite eq (|0D as ^ = {Vb)J^{bb)^ {b^bP) ^. It is 
now seen, that, for example, g = ~1 corresponds to bij = —b^j. 

According to eq (|18|), the value Tj^^^^ is proportional to g. At (7 = 1, we recover the 
result for BP = B [0. Appearance of positive, but not absolute correlation {0 < g < 1) 
has simple graphical meaning on the phase diagram. Fig. 1 — it leads to affine deformation 
of the boundary of target phase region on the phase diagram. At g = 0, the target region 
disappears, and it does not exist at g < 0. This is clear, because when matrices are anti- 
correlated, "design" does not help, but rather destroys the chances of polymer to fold into 
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desirable conformation. Thus, the correlation between matrices, given by the factor g, is the 
measure of the proximity between interaction matrices. 



V. DISCUSSION 

By performing explicit calculations for the freezing transition of heteropolymers with dif- 
ferent matrices for design and renaturation, we have found three phases: random, in which 
many conformations dominate equilibrium; frozen, where the polymer freezes to a single 
conformation other than the target conformation; and target, in which the polymer freezes 
to the target conformation. In the flexible chain limit, for the case where the design and re- 
naturation matrices are different, the effective critical selective temperature for renaturation 
to the target phase becomes modifled by a factor from the normalized correlation between 
the matrices (Tp^'^^-* — > T^'^'^^g). For complete correlation, g = 1. For differences in the design 
and renaturation matrix {g <1), special measures must be undertaken in order to keep the 
system in the target phase; otherwise, there is no possibility to obtain renaturation to the 
correct target conformation. 

To understand better the meaning of the result obtained, consider that proteins have 
been indeed "designed" according to one of the theoretical models [@-|^. This means, that 
they were prepared under the interactions B'^ at some temperature Tp < Tpf, and now they 
"work" at some other temperature T, such that Tpf < T < Ttar- It is worth stressing, that 
their "work" is governed by their natural interactions, that is, by the same matrix as 
was supposedly used for "design." We take now some other artiflcial matrix B and try (for 
example, by means of computer simulation) to recover the correct renaturation. In terms 
of our phase diagram. Fig. 1, correct renaturation occurs when and only when the system 
remains in the target phase. This is illustrated graphically in Fig. 1: the "Natural" phase 
diagram is presented there with solid lines, and real proteins are supposedly represented by 
the point within target phase region. Phase diagrams of a couple of artificial systems, with 
sequences designed by natural matrix B^ and conformations governed by mistaken matrices 
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B, are shown with dashed hnes; a greater degree of errors in the potentials push the phase 
boundary to the left, determined by the value of the factor g (defined by (^) ). In the 
example illustrated in the figure, the representative point for matrices with g > 0.95 remain 
within the target phase region, while those with g < 0.95 do not possess this property. In 
the first case, correct renaturation can be recovered, in the second case this is impossible. 
We conclude, that correct renaturation is possible when the degree of correlation is sufficient 
between and B, namely, when 

g > g* (the condition for correct renaturation) , (21) 



where g* is defined from the condition that the boundary Ttar, given by the equation (|T9| 
goes through the given point (Tp,T): 



T 

9* = ^ + 



T 

1 

Tf . 



(22) 



This can be also instructively rewritten as 



'~p{cr) 

9* = l+ ' (23) 

Note, that the ratio Tp/Tpf can serve as a measure of degree of selection of sequences: 
smaller values of Tp/Tpf correspond to stronger selection of sequences. At the same time, 
j'icr) _ jrjn^^ mcasurcs the degree of necessary selection at the given actual temperature, 
T (because T^^^'^ depends on T, equation (|18D). We conclude, that minimal "correctness" 
of interaction matrix, g* , is defined by the degree of selection, or optimization, of the set of 
real sequences: the better they have been optimized, the more stable is their renaturation 
with respect to the mistakes in interactions. 

Speaking about the numbers involved in the problem, we have to stress that the infor- 
mation available is by far insufficient to make any solid statements. To get some rough idea, 
we can proceed in the following way. When extracting a matrix of species-species energies 
for proteins from the statistics of protein data bank, such as the Miyazawa and Jernigan 
(MJ) matrix what one obtains is actually B^^^ = Bfj/Tp (see Ref. |^ and Appendix 



B). 
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\l/2 



Since from equation ( |T5| ) we have, = s^^'^Tf, then the variance of the MJ 

1 /2 

matrix yields (^{B^'^)'^'^ = s^^'^Tpj /Tp] therefore, with the knowledge of the MJ matrix 
^^Mj^2\ ^ 2.0) and the flexibility of proteins s {s ^ 1.6), we also arrive at Tp/Tpf ^ 0.9. 



It is also independently hypothesized that the ratio of the "folding" to the "glass" tem- 

we 



perature should be about 1.6. Without going into the arguments of the work Ref. |26 



can, quite arbitrarily, identify "folding" temperature with Ttar and "glass" temperature with 



Tpf] by doing so, we obtain Tt^-j./Tf ^ 1.6, which, in view of the equation (p!8[) , yields a sim- 
ilar estimate for the degree of optimization Tp/Tpf ^ 0.9. We conclude, that a conservative 
estimate of g* is likely to be about g* ^ 0.95: correct recovery of the native state require 
g > 0.95, if (? < 0.95 chances of correct renaturation are slim. 

Yet another, though purely algebraical, aspect of the problem is which types and values 
of errors in determination of interactions lead to a particular value of g factor, such as, for 
instance, 0.95 mentioned above. We first of all note that neither additive {Bij = Bfj + Bq) 
nor multiplicative [B^ = jSBfj) systematic errors do not contribute at all, g = 1 in both 
cases (as well as in the "combined" case Bij = /3Bfj + Bq). This is clear physically because 
these kinds of errors contribute to homopolymer terms only and do not affect selectivity of 
interactions of monomers to one another. On a more formal level, additive constants does not 
change second moments of matrices defined according to eq (|l^) , and multiplicative constant 



obviously does not affect the value of g (pO]). To get an idea about random mistakes, we 
examine the case where the renaturation matrix is the design matrix with some normally 
distributed noise rjij: Bij = Bfj{l + rjij), where V{rjij) oc exp[— r/f^/cr^]. We can average the 
g factor over the noise to get 



B^ 



1/2 



This gives a ~ 10%. More complicated random, systematic and mixed errors may be 
interesting to model and this can be easily accomplished within this formalism but the 
results are dependent of the specific nature of these errors and are therefore not within the 
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more general scope of this paper. 

It is worth making very clear that this error limit is independent of the length of the 
polymer. Previous calculations |I| have made estimates which are directly based upon (ie. 
the error must be small compared with 1/^/N). This reflects fundamental difference of our 
approach from that of the Ref. even more, this reflects the difference between questions 
studied. In the work Ref. [|ll, calculations were performed for random heteropolymers, 
neither any kind of design nor the principle of minimal frustration was imposed; accordingly, 
the question studied was in fact about the possibility to reconstruct the randomly chosen 
conformation of frozen globule phase. In the ensemble of random sequences, the ones with 
very stable ground state are very (exponentially) rare, thus, it is not surprising that these 
ground states are typically very unstable with respect to error-based renaturation, especially 
for long chains. By contrast, our treatment is a comparison between the types of freezing (to 
the target or some random conformation). This is therefore independent of the length of the 
polymer chain and essentially of a different nature than that of Ref. |]1|. Furthermore, within 
our formalism, the transition in y, the number of replicas in the target group, is first order; 
therefore, in the framework of our approach one cannot discuss the "degree" of renaturation 
in terms of a given percentage of correct contacts: in thermodynamic equilibrium, and in very 
long chain (thermodynamic limit) either there is renaturation to the target conformation or 
folding to some entirely different conformation. 

Within the framework of our formalism, the Independent Interaction Model can be 
recovered by addressing the limit q ^ N and assuming that i? is a normally distributed 
matrix; in this case eq (|15D agrees with the results of more direct calculations of this model 
P). The error limits in this approximation are derived in exactly the same manner as 
(p^). This is not surprising as, in fact, the validity of the approximation of taking the free 
energy to 0{B'^) in eq ( [T^ ) is similar to that of the Independent Interaction Model [Q: we 
assume that the effective flexibility s of the polymer is small. However, our treatment allows 
corrections to this approximation to be systematically derived. 

In conclusion, starting from the most general Hamiltonian involving short range binary 
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heteropolymeric interactions, we have derived what measure is used to compare differences 
in interaction potentials and the hmits in which renaturabihty to the target conformation is 
still allowed. Simple estimates of normally distributed error indicates that even conservative 
estimates leave room for 10% error in potentials. Using our formalism, one can make a more 
informed estimate based upon more precise knowledge of the form of errors involved, i.e. 
the correlations of errors in the matrix. 
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APPENDIX A: SIMPLIFICATION OF EQUATION (7) 

We will a slightly different notation from the rest of the paper to facilitate calculations: 
we eliminate indices and simply give the dimensionality of the operators explicitly, eg. we 
label Ajj as A'^'^-' since it is a g x g dimensional matrix. 

We perform the simplification of the elimination of replicas though several steps: 

1. g is of well-known one-step replica symmetry breaking shape, with one distinct group 
of ?/ + 1 replicas and {n — y)/x groups of x replicas each. 

2. M = / + 2pq ® A B can be viewed as [n + 1) x [n + 1) block matrix in replica space, 
with each matrix element being q x q matrix in species space. This block matrix is of 
the same structure as q, with one {y +1) x [y + 1) super-block and {n — y)/x of x x x 
super-blocks. 

3. The determinant in the first term in free energy is decomposed into the product of 
determinants of super-blocks. 

4. Vector p is composed of n + 1 "blocks" Pi, thus making the second term in free energy 
the sum of independent contributions from the groups of replicas. Along with previ- 
ous, this means that different groups of replicas do hot interact and this is why they 
contribute independently to the free energy. 

5. Effective replica energy E is now presented in the form 

E(x,y) n — y 

= ey + —^e^ , (Al) 

where and are the (independent) contributions from the corresponding groups of 
replicas. (Note, that replica entropy is also of the same form). 

6. Both Ey and ex have almost the same form as E (0), except simpler matrix g, with 
all matrix elements 1, appears instead of g: 
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= - In det 
2 



+ 



-1 



(A2) 



where z is either x or y + 1, i.e., the number of rephcas in the group. 
7. To simphfy first term (with determinant), we define rotation unitary operator 



exp 



Z I z 



2m 



a-l){(3-l) 



l<a,(3<z . 



(A3) 



It is easy to check that this operator transforms q into diagonal form, where one 
diagonal matrix element is 1, while all others are 0: 



n'^^'q^' (7^(")) ^ = A^") , where = zdo^Ap 



(A4) 



We define also TZ^^'^'^ = I^'^^ ® TZ^^^ and note that the determinant is not changed upon 
rotation. We write 



det 
det 
det 
det 
det 



det 



+ 2p (7^("^)) q-"^ ® a(«) 5(^5) in 



det 



7{zq) 



-1" 



X^q) 



(A5) 



As ij^^*?) is diagonal in replica space, i?^^'?) = I3j^^5af3, we have 



^{zq)\ j^(zq) f^(zq) 



2Tci 



75 ' V / 



-1 

5/3 



II exp 



(a-/3)(7-l) 



(A6) 



Taking into account the simple structure of A ([A^), we arrive at 



— (l-/5)(7-l) 

z 



A^q)m (A7) 
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First consider a non-target group of z = x replicas. In this group, all the replicas are 
identical meaning that i?^''-' = 13^''^ does not depend on replica index 7. This yields 



and thus 



det 



det 



(<?) 



(A9) 



9. Consider now target group oi z = y + 1 replicas. In this case, 

B^^i) = Mpi) for 7 = 1 

and i?^''-' = ]3^'^^ otherwise. We write therefore 

= T^'^S^p + 5„iA(^) - + iy+ l)5,i5i^A('^)5(^) . (AlO) 

This is the block matrix of the peculiar form such that only upper block is non- 
zero in the first column; for that reason, its determinant is equal to the product of 
determinants of diagonal blocks (see Lemma 1). Thus, 



det 



jiiy+m + 2pq 



det 



(All) 



10. As to the second term in (|A^), it is easily computed using Lemma 2. Indeed, 
jg block diagonal matrix with one block i?^''^ and y others B^'^\ On the other 
hand, g^^^^"* ® A^''^ is the block matrix with every block being the same A^''^ Therefore, 



the matrix in question. 



^ ( z) 

is exactly of the form VI- form. 



where g = 6S'^^B^^ and h = 6S-'^>B^'^\ Using block matrix multiplication rule, it i 



IS 



easy to compute (gee Lemma 2) and then to use the result of Lemma 



3. This finally gives 

1 
P 



P 



p(p 



-1 









(A12) 
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11. Similar expression for a non-target group of x replicas can be derived from here by 
formally putting 13^^^ — > B^'^^ and y + 1 — > x, this gives 



1 - 
~\P 
P \ 



pip 



P 

(A13) 



Lemma 1. 
Consider an auxiliary problem of the matrix 

This is block matrix, where g is q x q matrix and I is identity matrix of the same size 
q X q. The question is to find the determinant of this matrix. 

It can be shown by expansion over the elements of the first column, then over the elements 
of the first column of the remaining minor, and by repeating this operation q times, that 



det 



det^ 



(AM) 



independently of the blocks placed in the upper-right triangle (shown conventionally with 
question marks). 

Lemma 2. 

Consider another auxiliary problem of the following block matrix: 

Here g and h are matrices q x q, they generally do not commute to each other. / is 

^ (z) 

identity matrix of the same size q x q. Total size of the block matrix VJ:^ is, therefore, 
zq X zq. The question is to find inverse of the matrix vl^. 

It turns out that this inverse is in fact the matrix of the same structure, namely 



9,h 



Vl^-^ , where 



e^-{T+{z-l)h + g) ^g and f ^ - {T + {z - l)h + g) ^h. (A15) 



The result can be easily proved using block matrix multiplication rule. 

Lemma 3. 

Consider an auxiliary problem of the scalar product 
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where p^^^^ = p^^^ (8) i^^^ — Pi (does not depend on replica indices a), and W^'^^^ is block 
matrix comprised of blocks W^^ . Obviously, this scalar product is reduced to the scalar 
products of smaller dimensionality q, that is, purely in species space, summed over all the 
blocks of the matrix: 



, a/3 



(A16) 
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APPENDIX B: RELATIONSHIP BETWEEN THE AVERAGE NUMBER OF 
SPECIES-SPECIES CONTACTS AND THE INTERACTION MATRIX 

Note that this relation can be easily derived directly from our formalism as well: The 
Hamiltonian can also be expressed directly in terms of the number of contacts Uij between 
monomers of species i and j: Ti = J^ijBijriij, where we have previously substituted riij = 
Ef^jK,i^sj,jS{ri -rj). 

Therefore, the average number of contacts can be directly calculated in terms of the 
derivative of the free energy with respect to Bij. However, at this point, we must indicate 
one point in which we have been a bit cavalier in our previous derivation. Specifically, in 
order to perform the Hubbard-Stratonovich transformation, we have summed over all pairs 
of monomers instead of only the different pairs This overcounting of self-site 

interaction leads to a spurious term in the free energy AB. Excluding this term from the 
free energy, which is equivalent to performing the sum carrying terms in the free 

energy to 0{B'^), and taking the derivative with respect to Bij yields 

{nij) = PiPj (l - ^) ^ PiPj exp (Bl) 



of Ref. ||2^, i.e. either Tm = Tp for chains in the target phase, Tm = Tf for chains in the 
frozen phase, or = T for chains in the random phase. 
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FIGURES 

FIG. 1. Phase diagram for different values of the matrix similarity factor g. We find three phases 
for designed globular lictcropolymcrs: Random, in which many (0{e'^)) conformations dominate 
equilibrium much as the equilibrium conformation of a globular homopolymer; Frozen, in which 
only a few (C(l)) conformations dominate equilibrium in a glass-like phase; and Target in which 
only the target conformation (Native state) is found. For decreasing values of g, the boundary of 
the target phase moves to the left. For example, consider the case in which one performs a computer 
simulation of protein folding: nature has "prepared" proteins with the matrix of interactions BP 
at some preparation temperature Tp/Tpf < 1 and one wishes to renature these proteins with some 
simulated potentials B at some simulated acting temperature 1 < T/Tf < Ttaj-fTf, this desired 
pair of prepatation and acting temperatures is signified by a circle on the figure. If one could 
exactly reproduce the potentials used in Nature, i.e. B = B^, then g = I and the phase behavior 
is unchanged from Naturally renatured proteins: the circle is within the target phase. If the 
potentials used for renaturation are not precisely those used for preparation, i.e. g < I, then these 
errors in the potentials effect the nature of the phases by moving the boundary of the target phase 
to the left, thereby shrinking the target phase. If the errors are small compared with the degree 
of optimization, renaturation to the target phase is still possible, i.e. as shown in the figure, the 
circle is still in the target phase for g > 0.95. Physically, the optimization of the native state 
("preparation") allows some errors and this simply leads to a less optimized native state. Once 
these errors are large enough to overcome the optimization, the system will no longer renature to 
the target phase: in the figure we see that for g < 0.95, the circle is no longer in the target phase. 



